Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update osmGenus.txt #675

Open
wants to merge 1 commit into
base: modified
Choose a base branch
from

Conversation

davidpnewton
Copy link

Alphabetise, modernise taxonomy and increase number of entries

Alphabetise, modernise taxonomy and increase number of entries
@Helium314
Copy link
Owner

Is the modernised data still coming from OSM?

@davidpnewton
Copy link
Author

When I refer to "modernised" I mean bringing it in line with correct modern taxonomic standards. For example the inclusion of Aria and Scandosorbus is precisely that as they have been taken out of the complicated gestalt that Sorbus was and to some extent still is. Aria edulis and Scandosorbus intermedia are extremely common trees. Another example is × Hesperotropsis. That's a hybrid genus which contains × Hesperotropsis leylandii: it of the giant hedges causing neighbourly disputes the world over. Formerly that was in × Cupressocyparis, and it's an extremely common plant.

I've also made sure that all of the species mentioned in the species text file have their corresponding genus in this file. There was no obvious rhyme or reason for the genus being in there or not being in there in that respect.

@Helium314
Copy link
Owner

The osm*.txt files should only contain what is actually used in OSM. With SCEE I want to avoid pushing new / alternative keys and values as much as possible.
The current files were created from existing OSM data (taginfo), with (rather coarse) manual checks like removing invalid or duplicate values, perferring the more used version.
So the reason for the genus being in there or not being in there is because it's not / very rare in OSM data (as of 2 years ago).

If it's really ok for community to use the changed values (more relevant for the species file), then I'm ok with overhauling the lists, though maybe using different file names.

(Sorry for the late reply.)

@davidpnewton
Copy link
Author

There is a degree of taxonomic choice in those files. As mentioned in my previous reply I shifted things over to modern taxonomy for things like Aria edulis rather than Sorbus aria. Main place I'm going from is Plants of the World which is a fairly definitive resource on plant taxonomy. Sorbus aria has just over 5,200 occurrences, whereas Aria edulis has only 62 occurrences. My eventual aim is to get rid of all occurrences of Sorbus aria and replace them with Aria edulis.

Another thing that I've done is to substitute the multiplication sign × for the letter x in the species names for hybrids. That's because the × is distinct from the letter x and indicates either a nothospecies or nothogenus (hybrid species or hybrid genus). There are plenty of instances of the × being used in OSM rather than the x. Platanus × hispanica for example has just over 19,000 uses as opposed to Platanus x hispanica at just under 29,000 uses. The former is correct taxonomy and the latter is incorrect taxonomy and again I eventually intend to get rid of all of the latter and replace it with the former. Again taxonomic choice and judgement, but I think reasonable choice and judgement and also justifiable choice and judgement.

The third main thing I've done is get rid of all cultivars, forms, varieties and subspecies in the species file. The species key isn't necessarily strictly only for Linnean binomial names, but I think that the simplest thing for plants is to keep it to that since there are multiple things possible below the species level for plants. Animals only have subspecies as a possibility, so using that in the species tag is unambiguous. However plants could have a subspecies, a variety or a form for naturally occurring things or a cultivar (portmanteau of cultivated variety) for an artificially occurring thing. To me those more belong in the taxon field as they are very specific, difficult for a non-specialist to tell apart and it's also useful to be able to filter and search by the Linnean binomial name only for a lot of taxa.

The other issue is that for infraspecies names there can be some nasty gotchas in terms of nomenclature requirements to be correct and unambiguous. For a cultivar the name of the cultivar needs to be enclosed in single quotes for example and trademarked names also need to be watched out for as they can actually be applied to multiple different cultivars and thus cause confusion. Consider Thuja occidentalis GOLDEN SMARAGD v Thuja occidentalis 'Janed Gold' v Thuja occidentalis 'Smagrad'. The first two are synonyms of each other in practical terms with the first being a tradename and the second being a cultivar. However the third is also a cultivar that is distinct in appearance from the first two! To be properly unambiguous the cultivar name should be used in preference to any trade name.

I haven't necessarily preferred the most used version, although the most used version and the correct version correspond a great deal of the time. The Platanus example discussed above is a good case of that. Acerifolia is an obsolete description of the tree as opposed to hispanica. Therefore despite the fact that the two forms almost balance each other out in their use in the database I think that preferring hispanica is a good idea.

Some of the entries I have put in because I have used them in OSM in my own edits. Those are a minority. Hope this further clarifies what I was trying to achieve with these revised species and genus files.

@mnalis
Copy link
Collaborator

mnalis commented Jan 1, 2025

Thanks @davidpnewton for detailed reply (and the effort that you put into this PR!), but I must confess it is little over my head (even reminds me of XKCD #2501 😅 )

However, if I understand correctly, the list of tag values has been updated from some authoritative source external to OSM (i.e. "Plants of the World"), right?

If that is so, as noted in #675 (comment), in SCEE (just like in SC) we try not to promote (i.e. offer as presets) new OSM tagging (i.e. key=value pairs which are not currently being used in main OSM database in reasonable numbers).

That policy has nothing to do whether some tagging is a good idea, or more "correct", or more accepted in scientific community (e.g. only 62 "Aria edulis" which is allegedly better than those 5200 "Sorbus aria"), but depends on OSM community consensus instead.
There are several ways how that consensus can be determined (and influenced):

  1. statistics (i.e. if there are 5200 "Sorbus aria" but only 62 "Aria edulis" , then it seems OSM community prefers the former). This is AFAICT what SCEE has defaulted on (as it is the simplest). However it can be problematic (e.g. if community has just decided to move from old tag to the new one)
  2. formal OSM proposal which identifies and votes on a list of common values
  3. or at the very least OSM community discussion in tagging category which shows community acceptance of those suggestions/changes

In cases (2) and (3), one initiating the discussion should also add links to that discussion here in this PR as well as in appropriate wiki Talk page (e.g. https://wiki.openstreetmap.org/wiki/Talk:Key:species for species=* tag), so interested parties may follow.

@Helium314
Copy link
Owner

I shifted things over to modern taxonomy

The former is correct taxonomy and the latter is incorrect taxonomy and again I eventually intend to get rid of all of the latter and replace it with the former.

That's an important thing here. As @mnalis said, what is used in SCEE basically depends on community consensus, while the updated lists are your personal work.
Personally I don't have a problem with your overhaul, and the way you argue I think it makes sense. But SCEE does not let you freely choose the values, and in this sense is pushing whatever values are in the lists. The current SCEE approach of just taking the most used variant is not always correct, but it's very easy to argue for.
And I really don't want to spend my time on complaints about SCEE tagging, or on discussions in OSM forum for various reasons...

or at the very least OSM community discussion in tagging category which shows community acceptance of those suggestions/changes

I think that would be a very useful thing. Might also help others who are interested in species tagging, but don't use SCEE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants